Update `example-dvc-experiments` with `dvc exp init` and confusion matrix #97

iesahin · 2021-12-01T17:05:49Z

Updates https://github.com/iterative/example-dvc-experiments with

dvc exp init instead of dvc add stage
Generates confusion matrix data in plots/confusion.csv
Generates misclassification image in plots/confusion.png
Uses MNIST instead of Fashion-MNIST because confusion.png is more clear that way.

Closes #96

shcheklein · 2021-12-01T20:21:36Z

@iesahin could you please add the context in the description? why, previous discussions, etc. That would help to understand this :)

iesahin · 2021-12-02T16:49:33Z

could you please add the context in the description? why, previous discussions

I shouldn't add these just before the meeting :)

shcheklein · 2021-12-02T17:28:32Z

Thanks, Emre. It's still not enough context to meaningfully review this :(

Why do we need this repo? What is the plan - replace existing? keep both? etc?

What was the motivation behind doing this?

iesahin · 2021-12-04T13:28:00Z

Thanks, Emre. It's still not enough context to meaningfully review this :(

Why do we need this repo? What is the plan - replace existing? keep both? etc?

Actually, this was a temporary PR. Your confusion is because of not marking this as a draft or WIP I think. Sorry.

I'm testing how to generate a repository based on dvc exp init. But from my tests, the current exp init doesn't provide much faster intro to experiments. This is because:

(a) dvc init is still needed before dvc exp init.
(b) dvc add data/ is still needed after dvc exp init.

Basically, what (the current) dvc exp init does is something like dvc stage add with some sane defaults. (dvc exp init --interactive fills the pipeline elements by asking the user.) In the current intro to experiments,, we assume there is already a pipeline. If we remove that assumption and try to create a pipeline with dvc exp init, it will require more preparation to get to dvc exp run. Currently, we're hiding everything to a details section, but if we intend to start the GS:Experiments with dvc exp init, we'll first ask the user to dvc init, then dvc exp init, then dvc add data/ (which takes around 5 minutes to add 70K small files in the current dataset), then they'll be able to reach to a point that dvc exp run. From our previous discussions I know you would like to see dvc exp run as the first command, or at least on the first page. This is probably not possible if we use dvc exp init to initialize the project.

Another point is to make the experiment in a single stage if we use dvc exp init. In the current project, we have two stages. We dvc pull a single .tar.gz file, then extract stage splits this into 70K individual .png files, and train stage works with these individual files. If we'll use a single stage, either:

(a) we'll merge the extract stage to train.py script, that is the training will work on the .tar.gz file, or,
(b) we'll dvc pull 70K individual files from the remote to feed into train.py.

Option (b) proved to be too slow, will take at least 20-25 minutes to download, and I know (from our previous discussions) you don't want to work on a single file as the dataset as in option (a).

What was the motivation behind doing this?

My motivation was testing dvc exp init with the current dataset. I think we should keep the DVCLive one as the next iteration of experiments. When dvc exp init removes these dvc init and dvc add requirements, we can return to this project once more. WDYT? @shcheklein

shcheklein · 2021-12-04T18:50:23Z

cc @dberenbaum @efiop

How about a separate section that is focused more on dvc exp init itself? "Initialize Project"?

which takes around 5 minutes to add 70K small files in the current dataset

is it still the case? There were some improvements as far as I know ... could you point me to the dataset please to experiment a bit?

Option (b) proved to be too slow, will take at least 20-25 minutes to download

it seems realistically, DVC doesn't handle 70K at the moment ... at least for the quick start/get started project where speed is important

should we consider for now using something smaller/artificial? cc @dberenbaum @efiop ?

dberenbaum · 2021-12-04T19:52:05Z

What about starting by replacing the hidden Installing the example project section? Instead of cloning an existing dvc repo, the user can clone/download a stripped down git repo, and then we can show how to setup from there. The workflow can be like:

download/clone repo with code + params.yaml + requirements.txt
virtualenv setup
dvc init
dvc import data
dvc exp init

It's a lot of steps (basically what's there now plus dvc exp init), but they are all pretty transparent or simple to explain, and it gives users an idea of how to setup their own projects. It doesn't simplify the page, but it should make it more self contained.

No strong opinion on whether to keep this hidden or make a new section for it.

should we consider for now using something smaller/artificial? cc @dberenbaum @efiop ?

Let's check the times now, but IMO it's fine to use a subset of the data or a different data set if it still takes too long. Most users understand that tutorials use toy data to keep things moving.

iesahin · 2021-12-06T12:03:34Z

I've added some time commands to the repository generation. These run on DVC master by installing to a venv.

These are on WSL with a fairly good Windows laptop. I'll also test these on Google Cloud VM. You can test these yourselves by generating the repository with this branch: example-dvc-exp-init/generate.bash.

Some results:

time dvc add data/
+ dvc add data/
100% Adding...|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|1/1 [10:59, 659.43s/file]                                                                                                                                                                                     To track the changes with git, run:                                                                                                                                                                                                                                                                                                                                               git add data.dvc .gitignore

real    11m1.916s
user    9m51.770s
sys     5m47.967s

time dvc init
...
real    0m0.682s
user    0m0.530s
sys     0m0.088s

time dvc exp init python3 src/train.py
+ dvc exp init python3 src/train.py
Created default stage in dvc.yaml. To run, use "dvc exp run".
See https://dvc.org/doc/user-guide/experiment-management/running-experiments.

real    0m1.280s
user    0m1.179s
sys     0m0.083s

The following are for dvc exp run running src/train.py. Absolute times depend on train.py, but dvc exp run --queue takes around 40 seconds, and dvc exp run --run-all --jobs 2 doesn't lead to ~50% shorter times because of dvc checkout. (Actually per experiment time is around 2x with --queue.)

dvc exp run 
real    4m29.974s
user    10m53.876s
sys     0m59.886s

time dvc exp run -n cnn-32 --queue -S model.conv_units=32
+ dvc exp run -n cnn-32 --queue -S model.conv_units=32
Queued experiment '5235904' for future execution.

real    0m42.076s
user    0m32.779s
sys     0m5.813s

time dvc exp run -n cnn-64 --queue -S model.conv_units=64
+ dvc exp run -n cnn-64 --queue -S model.conv_units=64
Queued experiment '0bf5164' for future execution.

real    0m40.619s
user    0m32.523s
sys     0m4.828s

The following is for 4 experiments, set to run 2-by-2 in parallel. Note that plain dvc exp run takes around 4 minutes, and our expected results should be around 8-9 minutes (in total) for this case.

time dvc exp run --run-all --jobs 2
...
Reproduced experiment(s): cnn-128, cnn-96, cnn-64, cnn-32
...
real    42m9.655s
user    166m22.484s
sys     15m53.164s

And finally, :)

time dvc exp show --no-pager
+ dvc exp show --no-pager
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ Experiment            ┃ Created      ┃    loss ┃    acc ┃ train.epochs ┃ model.conv_units ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ workspace             │ -            │ 0.24566 │  0.908 │ 10           │ 16               │
│ baseline-experiment   │ Dec 02, 2021 │ 0.24566 │  0.908 │ 10           │ 16               │
│ ├── c9fb827 [cnn-64]  │ 03:23 PM     │ 0.23653 │ 0.9143 │ 10           │ 64               │
│ ├── f60d42c [cnn-32]  │ 03:22 PM     │ 0.23957 │  0.912 │ 10           │ 32               │
│ ├── 4141ac4 [cnn-128] │ 03:09 PM     │ 0.23462 │ 0.9174 │ 10           │ 128              │
│ └── 6a0bfa7 [cnn-96]  │ 03:05 PM     │ 0.25099 │ 0.9133 │ 10           │ 96               │
└───────────────────────┴──────────────┴─────────┴────────┴──────────────┴──────────────────┘

real    0m1.490s
user    0m0.892s
sys     0m0.189s

iesahin · 2021-12-06T12:46:41Z

The following are the time results on a Google Cloud VM:

time dvc get https://github.com/iterative/dataset-registry \
        fashion-mnist/images.tar.gz -o images.tar.gz
+ dvc get https://github.com/iterative/dataset-registry fashion-mnist/images.tar.gz -o images.tar.gz

real    0m3.536s
user    0m1.181s
sys     0m0.292s
time tar xvzf images.tar.gz
+ tar xvzf images.tar.gz

real    0m2.643s
user    0m0.916s
sys     0m2.045s
popd
+ popd

time dvc init
+ dvc init

real    0m8.066s
user    0m0.446s
sys     0m0.092s

# tag_tick
# git add .dvc
# git commit -m "Initialized DVC"
# git tag "dvc-init"
#
# dvc add data/images.tar.gz

time dvc exp init python3 src/train.py
+ dvc exp init python3 src/train.py

real    0m3.231s                                                                                                                                                                     user    0m1.177s
sys     0m0.141s

time dvc add data/
+ dvc add data/
100% Adding...|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|1/1 [05:52, 352.93s/file]

real    5m55.554s
user    4m37.304s                                                                                                                                                                    
sys     1m1.456s

time dvc exp run
+ dvc exp run

real    7m0.122s
user    6m52.626s
sys     0m15.181s

time dvc exp run -n cnn-32 --queue -S model.conv_units=32                                                                                                                            + dvc exp run -n cnn-32 --queue -S model.conv_units=32                                                                                                                                                                                                                                                                                                                    real    3m5.789s                                                                                                                                                                     
user    0m33.405s                                                                                                                                                                    
sys     0m5.045s
time dvc exp run -n cnn-64 --queue -S model.conv_units=64
+ dvc exp run -n cnn-64 --queue -S model.conv_units=64

real    3m6.764s
user    0m33.531s                                                                                                                                                                    
sys     0m5.153s
time dvc exp run -n cnn-96 --queue -S model.conv_units=96
+ dvc exp run -n cnn-96 --queue -S model.conv_units=96

real    3m6.632s
user    0m33.511s
sys     0m5.089s
time dvc exp run -n cnn-128 --queue -S model.conv_units=128
+ dvc exp run -n cnn-128 --queue -S model.conv_units=128

real    3m5.999s
user    0m33.134s
sys     0m5.062s

time dvc exp run --run-all --jobs 2                                                                                                                                                  
+ dvc exp run --run-all --jobs 2                                                                                                                                                                                                                                                                                                                                          real    39m29.080s                                                                                                                                                                   
user    70m26.471s                                                                                                                                                                   
sys     5m20.166s                                                                                                                                                                                                                                                                                                                                                         

time dvc exp show --no-pager                                                                                                                                                         
+ dvc exp show --no-pager                                                                                                                                                                                                                                                                                                                                                 real    0m1.650s                                                                                                                                                                     
user    0m0.960s                                                                                                                                                                     
sys     0m0.107s

Please note that the difference between parallel dvc exp run vs. the serial one. Running an experiment with dvc exp run --queue or --temp takes about 2x more per experiment.

Also, in this VM case, adding to the experiment queue takes around 3 minutes, vs 40 seconds on WSL. No other major processes were running during this test.

BTW, a plain python src/train.py takes around 4 minutes on this VM.

iesahin · 2021-12-07T09:24:10Z

What about starting by replacing the hidden Installing the example project section? Instead of cloning an existing dvc repo, the user can clone/download a stripped down git repo, and then we can show how to setup from there. The workflow can be like:
* download/clone repo with code + params.yaml + requirements.txt

* virtualenv setup

* dvc init

* dvc import data

* dvc exp init
It's a lot of steps (basically what's there now plus dvc exp init), but they are all pretty transparent or simple to explain, and it gives users an idea of how to setup their own projects. It doesn't simplify the page, but it should make it more self contained.

This is certainly possible, though I'm not sure if it's worth it. I was expecting dvc exp init will make this setup smoother, without additional needs for dvc init, dvc add or dvc import.

Another problem is the performance of dvc add and dvc import. To use dvc exp init, we require the experiment to have a single stage, and ideally, this single stage must use separate images in data/ as input. With 5-10 minutes to dvc add data/, or 20-30 minutes to dvc import data/, I doubt users will want to use DVC in another project even if they are patient enough to complete the hands-on tutorial.

No strong opinion on whether to keep this hidden or make a new section for it.

should we consider for now using something smaller/artificial? cc @dberenbaum @efiop ?

Let's check the times now, but IMO it's fine to use a subset of the data or a different data set if it still takes too long. Most users understand that tutorials use toy data to keep things moving.

This project is already a toy project, less than 40 MB of data in 70K small files. No serious user would have such a small project, our intended user base works with TBs level of data with millions of files. As a user, I'm frustrated from the slowness of DVC, and trying to come up with solutions to overcome this for the example projects. I believe we have more serious issues than writing a good tutorial.

Let me ask this straight, would you use DVC in a project with millions of files?

dberenbaum · 2021-12-07T19:18:20Z

This is certainly possible, though I'm not sure if it's worth it. I was expecting dvc exp init will make this setup smoother, without additional needs for dvc init, dvc add or dvc import.

Sorry, I may have given the wrong impression. Those features would be nice, but the primary purpose is to help users get started with experiments. The hope is that dvc exp init -i in particular provides a more user-friendly onboarding to experiments than dvc stage add ... that runs on for multiple lines with arcane arguments that each introduce a completely foreign concept to new users.

As far as needing additional commands, auto dvc init would be nice but is at least lightweight and self-explanatory. Auto dvc add is more important, but we still need some command for users to get the data initially, right? Does it save any steps in this particular workflow? cc @skshetry

This project is already a toy project, less than 40 MB of data in 70K small files. No serious user would have such a small project, our intended user base works with TBs level of data with millions of files. As a user, I'm frustrated from the slowness of DVC, and trying to come up with solutions to overcome this for the example projects. I believe we have more serious issues than writing a good tutorial.

Let me ask this straight, would you use DVC in a project with millions of files?

Maybe not -- I'm not really sure today. We are in the middle of changes to address these performance issues, especially for many files (not to mention we have an entirely new product being developed specifically to address this type of scenario). Please continue to comment in relevant issues in the core repo and open issues from your findings here. Maybe we can use these in dvc benchmarks. In the meantime, we still need to address docs needs.

FWIW, my experience is that I have used it for data in the 100s of GBs and found it extremely useful. I think it can feel slow and better performance would have a major impact, but I want to clarify from both personal experience and community interactions that it is useful today in real-world applications. A few points might explain this:

TBs of data with millions of files should absolutely be part of our intended user base, but this is by no means everyone. I don't even think 70K files is a particularly small dataset (40MB isn't the performance issue here AFAIK). There are plenty of bigger and smaller datasets in all kinds of real world applications.
I'm more likely to measure my ML project in hours than seconds. We have a toy dataset, but we also have a toy workflow. Tutorial steps need to run in seconds, real ML training workflows typically do not. I think the concerns about transferring many files are frustrations that translate to users, while I'm less concerned with things like dvc init time TBH.

While we wait for performance to improve, what other options do we have to move the docs forward?

Keep as an archive and extract as part of the utility functions in the stage.
Use a subset of data.
Use a different dataset (not focused on many small files).

Any other ideas?

iesahin · 2021-12-08T10:48:17Z

Dave,

When we discussed this topic a few months ago, Ivan assured me that the core team has a plan regarding these issues. I'm in no position to decide whether that plan is feasible or not, (and certainly I never intend to be a manager or criticize anyone or push the team to certain direction) but the current situation is not impressive, and I feel frustrated when it comes to tell features of a product that I cannot use pleasantly.

Note that my concerns are concerns of a user, not someone who's making decisions about the project. I used to use DVC to track my personal collections, but currently I don't. When I'm using our own product only because of the professional reasons, I believe that's a red flag.

I can write tickets, but I don't think the gravity of the situation is well understood. Performance (and security) are two aspects that you cannot add to a software project later, they are not like features that you can add at a certain point. Every technical decision regarding features must be made also considering its effect on performance (and security.)

Regarding the particular changes for the example project: I think we can keep the current docs and the project until the performance issues are resolved. I can convert the project to its original format, where the images are loaded from a single file, but I believe that's not @shcheklein would want.

shcheklein · 2021-12-09T00:59:26Z

@iesahin I think Dave and the team are very clearly understand the problem and are trying to address it as fast as possible. No one was saying that performance or security are not important. I see your frustration, but the part of building things fast in the early stage environment. We need to adapt quicker and find some workarounds faster. Let's try to discuss some options please and try to help the team as much as we can.

@dberenbaum

Keep as an archive and extract as part of the utility functions in the stage.

that's what we already do, but this would complicate the dvc exp init, right? that was the initial concern with trying to transition the project to dvc exp init

Use a subset of data.

probably won't work either - still many files to do it quick

Use a different dataset (not focused on many small files).

what are the options here? NLP problems with DL on a text file?

dberenbaum · 2021-12-09T14:28:41Z

@iesahin Are you following https://github.com/iterative/dvc-bench? @efiop and others are already working on improvements there, but your input can be helpful.

Keep as an archive and extract as part of the utility functions in the stage.

@shcheklein What I mean here is that we have single stage, and inside train.py it does the extraction on the fly.

shcheklein · 2021-12-09T17:57:42Z

@dberenbaum got it, @iesahin is it feasible? ( I remember we had some code that was reading archive on the "fly") ... may be we could even use something like hd5?

shcheklein · 2021-12-09T18:00:28Z

Or TensorFlow datasets/formats that package data?

iesahin · 2021-12-13T15:28:29Z

The earliest version of this project was using MNIST's custom image format to obtain the images on the fly. (It was in IPX format and generating numpy arrays from them.) We can revert to that if it sounds good.

Another option is to use a single tar file that contain PNG images. Python supports tar in the standard library.

We can also convert the project to a single file NLP project, similar to example-get-started, but I don't see it's necessary and either of the above two approaches will probably suffice.

@shcheklein @dberenbaum

shcheklein · 2021-12-14T00:00:54Z

Sounds good, Emre. Probably it's better to use tar or tesnsorflow format, custom MNIST format is too specific I guess.

iesahin · 2021-12-14T15:54:17Z

Sounds good, Emre. Probably it's better to use tar or tesnsorflow format, custom MNIST format is too specific I guess.

Thinking about this in the initial version, I've decided that "going with the default, as distributed from the dataset website" is more "excusable." Though it's easier to make a classifier that way, I don't like to use Tensorflow datasets, as we have the corresponding functionality in the dataset-registry.

I'm using the tar version if that's the deal? @shcheklein

iesahin · 2022-01-19T14:14:12Z

This is ready for review and merge. @shcheklein @dberenbaum

dberenbaum · 2022-01-20T20:45:46Z

example-dvc-experiments/code/src/train.py

+    return (training_images, training_labels, testing_images, testing_labels)
+
+
+def create_image_matrix(cells):


While this is a cool visual, it does add a quite a bit to the example code. What about using some built-in visualization, like https://keras.io/api/utils/model_plotting_utils/#plot_model-function? It's probably much less useful, but it's certainly less example code. Not a strong opinion if others prefer this visual.

IMO, our model structure is a bit static and not-so-exciting. It's very simple to visualize THB 😅

I can think of other, simpler ways. This came to my mind as a "confusion matrix" in image form. (I was thinking to put all classification errors in tiny little boxes on a large image, but one from each "confusion" looked better.)

The code is not that visible to the users. I don't think they'll peek inside the code that much. We'll just show the results and (maybe) link to the code in Github.

Moved this code to util.py. I think we can consider this resolved.

example-dvc-experiments/code/src/train.py

shcheklein · 2022-01-22T21:09:15Z

example-dvc-experiments/generate.bash

@@ -4,7 +4,7 @@ set -veux

 HERE="$( cd "$(dirname "$0")" ; pwd -P )"
 export HERE
-PROJECT_NAME="example-dvc-experiments"
+PROJECT_NAME="example-dvc-staging"


should we rename it back?

which one should be permanent? I usually create & use (private) example-dvc-staging more frequently than (public) example-dvc-experiments and pushing the example repository with the created script is very easy, and may be used mistakenly to push versions with bugs, etc.

I don't have strong opinions here, the user must edit this line before generating the script.

Can we rename the repository itself after building and pushing? We can rename example-dvc-experiments to example-dvc-experiments-22-01-26 and archive, then rename example-dvc-staging to example-dvc-experiments, and make public. Is it too much work?

example-dvc-experiments/generate.bash

shcheklein · 2022-01-22T21:10:01Z

example-dvc-experiments/generate.bash

 cp "${HERE}"/code/params.yaml .
+## We are assuming the repo is generated in Linux
+## Otherwise the following line must be changed to have requirements-macos.txt


can we detect this?

Yes, we can detect it. Would you like to support Windows as well?

Added support with uname -s for macOS only. If you'd like Windows support, let me know.

example-dvc-experiments/generate.bash

shcheklein

Looks almost good, some cleanup is required

… commands

iesahin · 2022-02-01T13:05:14Z

example-dvc-experiments/generate.bash

 cp "${HERE}"/code/params.yaml .
-pip install -r "${REPO_PATH}"/requirements.txt
+if [[ $(uname -s) == 'Darwin' ]] ; then
+    pip install -r "${REPO_PATH}"/requirements-macos.txt


It looks macOS requires the whole generation to depend on conda in M1 macs. This can be solved, but it requires a generate-macos.bash that will use conda instead of pip. I can work on this, WDYT? @shcheklein

Keeping this as is for future, maybe we'll be able to install TF with pip on macs someday.

iesahin added 2 commits December 1, 2021 14:09

copied from exampl-dvc-experiments

44c8861

fixed get_model params and added confusion matrix

d2b508d

iesahin self-assigned this Dec 1, 2021

iesahin added 5 commits December 4, 2021 12:55

moved exp init a few lines below

8105c2c

add data

27f79da

add a commit for artifacts

aeb12d4

git status for debugging

3ba6a3b

comment out git add to understand what has changed

0680566

removed read stmts

fdbbc2d

added time to measure running times

f137dd6

iesahin mentioned this pull request Dec 8, 2021

example-dvc-experiments: Added modifications for DVCLive #95

Closed

adding untar_dataset function

31a871b

iesahin added 5 commits January 6, 2022 19:11

modified to copy requirements-macos.txt separately

a6ff65f

modified to build with mnist instead of fashion-mnist

ee6a7aa

updated colors for misclassification

a189b12

requirements now contains dvc[all]

0463c7c

requirements now contains dvc[all]

04a55d7

iesahin changed the title ~~Add example-dvc-exp-init repository generation~~ Update example-dvc-experiments with dvc exp init and confusion matrix Jan 12, 2022

Merge branch 'master' into iesahin/example-dvc-exp-init

6f76e9b

iesahin mentioned this pull request Jan 19, 2022

Various issues in example-dvc-experiments #98

Closed

dberenbaum reviewed Jan 20, 2022

View reviewed changes